The dataset contains data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign. Domain: Banking
This case is about a bank (Thera Bank) whose management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with a minimal budget. Attribute Information:
import warnings
warnings.filterwarnings('ignore')
#Load Libraries
#!pip install pandas_profiling
import pandas as pd
import pandas_profiling
import numpy as np
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix,recall_score, precision_score, f1_score, roc_auc_score,accuracy_score
import matplotlib.pyplot as plt
%matplotlib inline
# Remove scientific notations and display numbers with 2 decimal points instead
pd.options.display.float_format = '{:,.2f}'.format
#Load data
thera_df = pd.read_csv('Bank_Personal_Loan_Modelling.csv')
thera_df.head()
# clean column names
thera_df.columns = ["ID","Age","Experience","Income","ZIPCode","Family","CCAvg","Education","Mortgage","PersonalLoan","SecuritiesAccount","CDAccount","Online","CreditCard"]
thera_df.shape
thera_df.dtypes
thera_df.isna().sum()
thera_df.describe().transpose()
#Number of unique in each column
thera_df.nunique()
#Number of people with zero mortgage
len(thera_df[thera_df.Mortgage==0])
#Number of people with zero credit card spending per month?
len(thera_df[thera_df.CCAvg==0])
#Value counts of all categorical columns.
#family and Education are categorical
thera_df.Education.value_counts()
thera_df.Family.value_counts()
#Univariate and Bivariate
thera_df.profile_report()
# negative values in Experince
thera_df.loc[(thera_df['Experience'] < 0), 'Experience'] = np.nan
# replace by median
thera_df['Experience'].fillna(thera_df['Experience'].median(),inplace=True)
len(thera_df[thera_df['Experience'] < 0])
thera_df.columns
cont_cols=['Age','Experience','Income','CCAvg','Mortgage']
thera_df[cont_cols]
indx=1
plt.figure(figsize= (40.5,40.5))
for col in cont_cols:
plt.subplot(5,2,indx)
plt.hist(thera_df[col])
plt.xlabel(col)
indx=indx+1
plt.subplot(5,2,indx)
sns.boxplot(x= thera_df[col])
indx=indx+1
Age seems to be quite normally distributed with majority of 35 and 55 years of age
Experience also normally distibuted with 11 to 30 q1 and q3
Income, CC Average and Mortgage are highly skewed
Significant outliers in mortagae and CCAvg
cat_cols=['Family',
'Education', 'SecuritiesAccount',
'CDAccount', 'Online', 'CreditCard']
indx=1
plt.figure(figsize= (30,50))
for col in cat_cols:
plt.subplot(6,2,indx)
thera_df[col].value_counts().plot(kind="bar", align='center',edgecolor = 'black')
plt.xlabel(col)
indx=indx+1
1. Family and education are quite evenly distributed
2. CD , CC and Securities have vast difference
3. Online shopping also have significant difference in population
top5zipcode = thera_df[thera_df['PersonalLoan']==1]['ZIPCode'].value_counts().head(5)
top5zipcode
thera_df.columns
sns.countplot(x="CDAccount", data=thera_df,hue="PersonalLoan")
sns.countplot(x="CreditCard", data=thera_df,hue="PersonalLoan")
sns.distplot( thera_df[thera_df['PersonalLoan'] == 0]['CCAvg'], color = 'r')
sns.distplot( thera_df[thera_df['PersonalLoan'] == 1]['CCAvg'], color = 'g')
Customers with higher credit card average have taken more personal loan have .
sns.distplot( thera_df[thera_df.PersonalLoan == 0]['Income'], color = 'r')
sns.distplot( thera_df[thera_df.PersonalLoan == 1]['Income'], color = 'g')
Customers with higher income have taken more personal loan have .
pairplt = sns.pairplot(thera_df[['Age','Experience','Income','ZIPCode','Family','CCAvg' ,'Education' , 'Mortgage','PersonalLoan','SecuritiesAccount','CDAccount','Online','CreditCard']] )
# Corelation comparison
plt.figure(figsize=(25, 25))
ax = sns.heatmap(thera_df.corr(), fmt='.2f', annot=True, linecolor='white',linewidths=0.01,square=True)
plt.title('Correlation')
plt.show()
The data set got 0 missing or null values
Column 'Experience' has negative values
Correlation:
We can drop ‘ID’, ‘Experience’ columns for further analysis as ‘ID’ is a series & ‘Experience’ and ‘Age’ are highly correlated.
#Get data model ready
df = thera_df.copy()
thera_df.head()
df.head(2)
#Drop 'ID', 'ZIP Code' & 'Experience' and move 'Personal Loan' to last column
y= df['PersonalLoan']
df.drop(['PersonalLoan'], axis = 1,inplace = True)
df.drop(['Experience'], axis = 1,inplace = True)
df['PersonalLoan'] = y
df.drop(['ID'], axis = 1,inplace = True)
df.head(2)
#Look at the data distribution
df.describe().transpose()
array = df
X= array.iloc[:,0:11]
y= array.iloc[:,11]
#Convert categorical vriables to dummy variables
X = pd.get_dummies(X, drop_first=True)
X.head(2)
y.head(2)
print('df', df.shape)
print('X', X.shape)
print("y",y.shape)
X.head()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
print('X_train', X_train.shape)
print("X_test",X_test.shape)
print('y_train', y_train.shape)
print("y_test",y_test.shape)
## function to get confusion matrix in a proper format
def draw_cm( actual, predicted ):
cm = confusion_matrix( actual, predicted)
sns.heatmap(cm, annot=True, fmt='.2f', xticklabels = [0,1] , yticklabels = [0,1] )
plt.ylabel('Observed')
plt.xlabel('Predicted')
plt.show()
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on train set: {:.2f}'.format(logreg.score(X_train, y_train)))
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))
print("Trainig accuracy",logreg.score(X_train,y_train))
print()
print("Testing accuracy",logreg.score(X_test, y_test))
print()
print('Confusion Matrix')
print(draw_cm(y_test,y_pred))
print()
print("Recall:",recall_score(y_test,y_pred))
print()
print("Precision:",precision_score(y_test,y_pred))
print()
print("F1 Score:",f1_score(y_test,y_pred))
print()
print("Roc Auc Score:",roc_auc_score(y_test,y_pred))
df_confusion_matrix = confusion_matrix(y_test, y_pred)
print(df_confusion_matrix)
print(classification_report(y_test, y_pred))
#AUC ROC curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
# Checking Parameters of logistic regression
logreg.get_params()
Hint: Check parameter
Try dropping ZIP CODE AS THEY ARE SERIES DATA
Here recall is more important then accuracy. As the recall is low at 32 and false negatives are high at 107, we will try to optimize the Logistic regression parameters
# Running a loop to check different values of 'solver'
# all solver can be used with l2, only 'liblinear' and 'saga' works with both 'l1' and 'l2'
train_score=[]
test_score=[]
recall_score_list=[]
solver = ['newton-cg','lbfgs','liblinear','sag','saga']
for i in solver:
model = LogisticRegression(random_state=42,penalty='l2', C = 0.75,solver=i) # changing values of solver
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
train_score.append(round(model.score(X_train, y_train),3))
test_score.append(round(model.score(X_test, y_test),3))
recall_score_list.append(round(recall_score(y_test,y_predict),3))
print(solver)
print()
print(train_score)
print()
print(test_score)
print()
print(recall_score_list)
train_score=[]
test_score=[]
recall_score_list=[]
solver = ['liblinear','saga'] # changing values of solver which works with 'l1'
for i in solver:
model = LogisticRegression(random_state=42,penalty='l1', C = 0.75,solver=i) #changed penalty to 'l1'
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
train_score.append(round(model.score(X_train, y_train),3))
test_score.append(round(model.score(X_test, y_test),3))
recall_score_list.append(round(recall_score(y_test,y_predict),3))
print(solver)
print()
print(train_score)
print()
print(test_score)
print()
print(recall_score_list)
## Highest accuracy and recall of 'l1' with 'liblinear' are little better 'l2' with 'newton-cg'
# choosing liblinear with 'l1'
model = LogisticRegression(random_state=42,penalty='l1',solver='liblinear',class_weight='balanced') # changing class weight to balanced
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
print("Trainig accuracy",model.score(X_train,y_train))
print()
print("Testing accuracy",model.score(X_test, y_test))
print()
print("Recall:",recall_score(y_test,y_predict))
# Running a loop to check different values of 'C'
train_score=[]
test_score=[]
recall_score_list=[]
C = [0.01,0.1,0.25,0.5,0.75,1]
for i in C:
model = LogisticRegression(random_state=42,penalty='l1', solver='liblinear',class_weight='balanced', C=i) # changing values of C
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
train_score.append(round(model.score(X_train,y_train),3)) # appending training accuracy in a blank list for every run of the loop
test_score.append(round(model.score(X_test, y_test),3)) # appending testing accuracy in a blank list for every run of the loop
recall_score_list.append(round(recall_score(y_test,y_predict),3))
print(C)
print()
print(train_score)
print()
print(test_score)
print()
print(recall_score_list)
#Therefore final model is
model = LogisticRegression(random_state=42,penalty='l1', solver='liblinear', class_weight='balanced',C=0.25)
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
print("Trainig accuracy",model.score(X_train,y_train))
print()
print("Testing accuracy",model.score(X_test, y_test))
print()
print('Confusion Matrix')
print(draw_cm(y_test,y_predict))
print()
print("Recall:",recall_score(y_test,y_predict))
print()
print("Precision:",precision_score(y_test,y_predict))
print()
print("F1 Score:",f1_score(y_test,y_predict))
print()
print("Roc Auc Score:",roc_auc_score(y_test,y_predict))
cm = confusion_matrix( y_test,y_predict)
print(cm)
The campaign's goal is to have more people to accept personal loan i.e. less number of False Negative, so it don't lose customers who want to take personal loan.
Types of error:
In this use case, we need to reduce Type II error to not lose the people who would take loans. Hence, Recall is MORE important then precision for our usecase.
In input data, the number of buyer’s percentage to the non-buyer percentage is very less. Hence, the accuracy is not an important performance criteria.
So, the right model for this usecase has to have minimize Type II error and have high recall (low false negative)
Final model has recall score of 91 % up from initial 32% and the false negative are down from 107 to 13.
After achieving the desired accuracy we can deploy the model for practical use. As in the bank now can predict who is interested in personal loans. They can use the model for upcoming campaigns and customers.